{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 上下文块标题（CCH）\n",
    "\n",
    "通过在生成响应之前检索相关外部知识，检索增强生成（RAG）提高了语言模型的事实准确性。然而，标准的分块往往丢失重要的上下文，使得检索效果不佳。\n",
    "上下文块标题（CCH）通过在每个块嵌入之前添加高级上下文（如文档标题或部分标题）来增强 RAG。这提高了检索质量，并防止了不相关的响应。\n",
    "\n",
    "------\n",
    "实现步骤：\n",
    "- 数据采集：从 PDF 中提取文本\n",
    "- **带上下文标题的块分割：提取章节标题（或使用模型为块生成标题）并将其添加到块的开头。**\n",
    "- 嵌入创建：将文本块转换为数值表示\n",
    "- 语义搜索：根据用户查询检索相关块\n",
    "- 回答生成：使用语言模型根据检索到的上下文生成回答。\n",
    "- 评估：使用评估数据集评估模型性能。"
   ],
   "id": "4d83a2679778fe83"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:27:59.569889Z",
     "start_time": "2025-04-23T11:27:58.483473Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import fitz\n",
    "import os\n",
    "import json\n",
    "import numpy as np\n",
    "from tqdm import tqdm\n",
    "from openai import OpenAI\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv()"
   ],
   "id": "de4a41b11d1c0538",
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "python-dotenv could not parse statement starting at line 1\n",
      "python-dotenv could not parse statement starting at line 2\n",
      "python-dotenv could not parse statement starting at line 3\n",
      "python-dotenv could not parse statement starting at line 4\n",
      "python-dotenv could not parse statement starting at line 5\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 1
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:27:59.884506Z",
     "start_time": "2025-04-23T11:27:59.593611Z"
    }
   },
   "cell_type": "code",
   "source": [
    "client = OpenAI(\n",
    "    base_url=os.getenv(\"LLM_BASE_URL\"),\n",
    "    api_key=os.getenv(\"LLM_API_KEY\")\n",
    ")\n",
    "llm_model = os.getenv(\"LLM_MODEL_ID\")\n",
    "embedding_model = os.getenv(\"EMBEDDING_MODEL_ID\")\n",
    "\n",
    "pdf_path = \"../../data/AI_Information.en.zh-CN.pdf\""
   ],
   "id": "7d59f1d452ca2764",
   "outputs": [],
   "execution_count": 2
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 提取文本并识别章节标题\n",
    "\n",
    "从 PDF 中提取文本，同时识别章节标题（块的可能标题）"
   ],
   "id": "2b302f42a7b1a47"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:00.138412Z",
     "start_time": "2025-04-23T11:28:00.132180Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def extract_text_from_pdf(pdf_path):\n",
    "    \"\"\"\n",
    "    从 PDF 文件中提取文本，并打印前 `num_chars` 个字符。\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file.\n",
    "\n",
    "    Returns:\n",
    "    str: Extracted text from the PDF.\n",
    "    \"\"\"\n",
    "    # 打开 PDF 文件\n",
    "    mypdf = fitz.open(pdf_path)\n",
    "    all_text = \"\"  # 初始化一个空字符串以存储提取的文本\n",
    "\n",
    "    # Iterate through each page in the PDF\n",
    "    for page_num in range(mypdf.page_count):\n",
    "        page = mypdf[page_num]\n",
    "        text = page.get_text(\"text\")  # 从页面中提取文本\n",
    "        all_text += text  # 将提取的文本追加到 all_text 字符串中\n",
    "\n",
    "    return all_text  # 返回提取的文本"
   ],
   "id": "6a9ba6097041c7d0",
   "outputs": [],
   "execution_count": 3
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## 文本分块与上下文标题\n",
    "\n",
    "为了提高检索效率，我们使用LLM模型为每个片段生成描述性标题"
   ],
   "id": "ae49625d7dfc0096"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:00.151670Z",
     "start_time": "2025-04-23T11:28:00.147224Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def generate_chunk_header(chunk):\n",
    "    \"\"\"\n",
    "    使用 LLM 为给定的文本块生成标题/页眉\n",
    "\n",
    "    Args:\n",
    "        chunk (str): T要总结为标题的文本块\n",
    "        model (str): 用于生成标题的模型\n",
    "\n",
    "    Returns:\n",
    "        str: 生成的标题/页眉\n",
    "    \"\"\"\n",
    "    # 定义系统提示\n",
    "    system_prompt = \"为给定的文本生成一个简洁且信息丰富的标题。\"\n",
    "\n",
    "    # 根据系统提示和文本块生成\n",
    "    response = client.chat.completions.create(\n",
    "        model=llm_model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": chunk}\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    # 返回生成的标题/页眉，去除任何前导或尾随空格\n",
    "    return response.choices[0].message.content.strip()"
   ],
   "id": "16d2708afeb1e6bc",
   "outputs": [],
   "execution_count": 4
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:00.165301Z",
     "start_time": "2025-04-23T11:28:00.157517Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def chunk_text_with_headers(text, n, overlap):\n",
    "    \"\"\"\n",
    "    将文本分割为较小的片段，并生成标题。\n",
    "\n",
    "    Args:\n",
    "        text (str): 要分块的完整文本\n",
    "        n (int): 每个块的字符数\n",
    "        overlap (int): 块之间的重叠字符数\n",
    "\n",
    "    Returns:\n",
    "        List[dict]: 包含 'header' 和 'text' 键的字典列表\n",
    "    \"\"\"\n",
    "    chunks = []\n",
    "\n",
    "    # 按指定的块大小和重叠量遍历文本\n",
    "    for i in range(0, len(text), n - overlap):\n",
    "        chunk = text[i:i + n]\n",
    "        header = generate_chunk_header(chunk)  # 使用 LLM 为块生成标题\n",
    "        chunks.append({\"header\": header, \"text\": chunk})  # 将标题和块添加到列表中\n",
    "\n",
    "    return chunks"
   ],
   "id": "c30baa76ec293c1e",
   "outputs": [],
   "execution_count": 5
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 从 PDF 文件中提取和分块文本",
   "id": "515f82f79fdaaca7"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:22.395915Z",
     "start_time": "2025-04-23T11:28:00.173820Z"
    }
   },
   "cell_type": "code",
   "source": [
    "extracted_text = extract_text_from_pdf(pdf_path)\n",
    "\n",
    "# Chunk the extracted text with headers\n",
    "# We use a chunk size of 1000 characters and an overlap of 200 characters\n",
    "text_chunks = chunk_text_with_headers(extracted_text, 1000, 200)\n",
    "\n",
    "# Print a sample chunk with its generated header\n",
    "print(\"Sample Chunk:\")\n",
    "print(\"Header:\", text_chunks[0]['header'])\n",
    "print(\"Content:\", text_chunks[0]['text'])"
   ],
   "id": "b04809f55d019f36",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample Chunk:\n",
      "Header: 人工智能基础：从概念到核心技术\n",
      "Content: 理解⼈⼯智能\n",
      "第⼀章：⼈⼯智能简介\n",
      "⼈⼯智能 (AI) 是指数字计算机或计算机控制的机器⼈执⾏通常与智能⽣物相关的任务的能⼒。该术\n",
      "语通常⽤于开发具有⼈类特有的智⼒过程的系统，例如推理、发现意义、概括或从过往经验中学习\n",
      "的能⼒。在过去的⼏⼗年中，计算能⼒和数据可⽤性的进步显著加速了⼈⼯智能的开发和部署。\n",
      "历史背景\n",
      "⼈⼯智能的概念已存在数个世纪，经常出现在神话和⼩说中。然⽽，⼈⼯智能研究的正式领域始于\n",
      "20世纪中叶。1956年的达特茅斯研讨会被⼴泛认为是⼈⼯智能的发源地。早期的⼈⼯智能研究侧\n",
      "重于问题解决和符号⽅法。20世纪80年代专家系统兴起，⽽20世纪90年代和21世纪初，机器学习\n",
      "和神经⽹络取得了进步。深度学习的最新突破彻底改变了这⼀领域。\n",
      "现代观察\n",
      "现代⼈⼯智能系统在⽇常⽣活中⽇益普及。从 Siri 和 Alexa 等虚拟助⼿，到流媒体服务和社交媒体\n",
      "上的推荐算法，⼈⼯智能正在影响我们的⽣活、⼯作和互动⽅式。⾃动驾驶汽⻋、先进的医疗诊断\n",
      "技术以及复杂的⾦融建模⼯具的发展，彰显了⼈⼯智能应⽤的⼴泛性和持续增⻓。此外，⼈们对其\n",
      "伦理影响、偏⻅和失业的担忧也⽇益凸显。\n",
      "第⼆章：⼈⼯智能的核⼼概念\n",
      "机器学习\n",
      "机器学习 (ML) 是⼈⼯智能的⼀个分⽀，专注于使系统⽆需明确编程即可从数据中学习。机器学习\n",
      "算法能够识别模式、做出预测，并随着接触更多数据⽽不断提升其性能。\n",
      "监督学习\n",
      "在监督学习中，算法基于标记数据进⾏训练，其中输⼊数据与正确的输出配对。这使得算法能够学\n",
      "习输⼊和输出之间的关系，并对新的、未知的数据进⾏预测。⽰例包括图像分类和垃圾邮件检测。\n",
      "⽆监督学习\n",
      "⽆监督学习算法基于未标记数据进⾏训练，算法必须在没有明确指导的情况下发现数据中的模式和\n",
      "结构。常⽤技术包括聚类（将相似的数据点分组）和降维（在保留重要信息的同时减少变量数\n",
      "量）。\n",
      "从英语翻译成中⽂(简体) - www.onlinedoctranslator.com\n",
      "强化学习\n",
      "强化学习涉及训练代理在特定环境中做出决策，以最⼤化奖励。代理通过反复试验进⾏学习，并以\n",
      "奖励或惩罚的形式接收反馈。这种⽅法应⽤于游戏、机器⼈技术和资源管理。\n",
      "深度学习\n",
      "深度学习是机器学习的⼀个⼦领域，它使⽤多层⼈⼯神经⽹络（深度神经⽹络）来分析数据。这些\n",
      "⽹络的设计灵感来源于⼈脑的结构和功能。深度学习在图像识别、⾃然语⾔处理和语⾳识别等领域\n",
      "\n"
     ]
    }
   ],
   "execution_count": 6
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 为标题和文本创建嵌入",
   "id": "6fd4b860fad921a0"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:22.424617Z",
     "start_time": "2025-04-23T11:28:22.420016Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def create_embeddings(texts):\n",
    "    \"\"\"\n",
    "    为文本列表生成嵌入\n",
    "\n",
    "    Args:\n",
    "        texts (List[str]): 输入文本列表.\n",
    "\n",
    "    Returns:\n",
    "        List[np.ndarray]: List of numerical embeddings.\n",
    "    \"\"\"\n",
    "    # 确保每次调用不超过64条文本\n",
    "    # batch_size = 64\n",
    "    # embeddings = []\n",
    "    #\n",
    "    # for i in range(0, len(texts), batch_size):\n",
    "    #     batch = texts[i:i + batch_size]\n",
    "    #     response = client.embeddings.create(\n",
    "    #         model=embedding_model,\n",
    "    #         input=batch\n",
    "    #     )\n",
    "    #     # 将响应转换为numpy数组列表并添加到embeddings列表中\n",
    "    #     embeddings.extend([np.array(embedding.embedding) for embedding in response.data])\n",
    "    #\n",
    "    # return embeddings\n",
    "    response = client.embeddings.create(\n",
    "        model=embedding_model,\n",
    "        input=texts\n",
    "    )\n",
    "    return response.data[0].embedding\n",
    "\n",
    "\n",
    "# response = create_embeddings(text_chunks)"
   ],
   "id": "245dbc81e0ac766c",
   "outputs": [],
   "execution_count": 7
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:27.134582Z",
     "start_time": "2025-04-23T11:28:22.442672Z"
    }
   },
   "cell_type": "code",
   "source": [
    "embeddings = []  # Initialize an empty list to store embeddings\n",
    "\n",
    "# Iterate through each text chunk with a progress bar\n",
    "for chunk in tqdm(text_chunks, desc=\"Generating embeddings\"):\n",
    "    # Create an embedding for the chunk's text\n",
    "    text_embedding = create_embeddings(chunk[\"text\"])\n",
    "    # print(text_embedding.shape)\n",
    "    # Create an embedding for the chunk's header\n",
    "    header_embedding = create_embeddings(chunk[\"header\"])\n",
    "    # Append the chunk's header, text, and their embeddings to the list\n",
    "    embeddings.append({\"header\": chunk[\"header\"], \"text\": chunk[\"text\"], \"embedding\": text_embedding,\n",
    "                       \"header_embedding\": header_embedding})"
   ],
   "id": "8fed734963fb5962",
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Generating embeddings: 100%|██████████| 13/13 [00:04<00:00,  2.78it/s]\n"
     ]
    }
   ],
   "execution_count": 8
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 语义搜索",
   "id": "d6f387ad8b7f1324"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:27.145364Z",
     "start_time": "2025-04-23T11:28:27.141250Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def cosine_similarity(vec1, vec2):\n",
    "    \"\"\"\n",
    "    Computes cosine similarity between two vectors.\n",
    "\n",
    "    Args:\n",
    "    vec1 (np.ndarray): First vector.\n",
    "    vec2 (np.ndarray): Second vector.\n",
    "\n",
    "    Returns:\n",
    "    float: Cosine similarity score.\n",
    "    \"\"\"\n",
    "\n",
    "    # Compute the dot product of the two vectors\n",
    "    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))"
   ],
   "id": "e45d3bebf4483f80",
   "outputs": [],
   "execution_count": 9
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:27.160482Z",
     "start_time": "2025-04-23T11:28:27.154794Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def semantic_search(query, chunks, k=5):\n",
    "    \"\"\"\n",
    "    根据查询搜索最相关的块\n",
    "\n",
    "    Args:\n",
    "    query (str): 用户查询\n",
    "    chunks (List[dict]): 带有嵌入的文本块列表\n",
    "    k (int): 返回的相关chunk数\n",
    "\n",
    "    Returns:\n",
    "    List[dict]: Top-k most relevant chunks.\n",
    "    \"\"\"\n",
    "    query_embedding = create_embeddings(query)\n",
    "    # print(query_embedding)\n",
    "    # print(query_embedding.shape)\n",
    "\n",
    "    similarities = []\n",
    "\n",
    "    # 遍历每个块以计算相似度分数\n",
    "    for chunk in chunks:\n",
    "        # Compute cosine similarity between query embedding and chunk text embedding\n",
    "        sim_text = cosine_similarity(np.array(query_embedding), np.array(chunk[\"embedding\"]))\n",
    "        # sim_text = cosine_similarity(query_embedding, chunk[\"embedding\"])\n",
    "\n",
    "        # Compute cosine similarity between query embedding and chunk header embedding\n",
    "        sim_header = cosine_similarity(np.array(query_embedding), np.array(chunk[\"header_embedding\"]))\n",
    "        # sim_header = cosine_similarity(query_embedding, chunk[\"header_embedding\"])\n",
    "        # 计算平均相似度分数\n",
    "        avg_similarity = (sim_text + sim_header) / 2\n",
    "        # Append the chunk and its average similarity score to the list\n",
    "        similarities.append((chunk, avg_similarity))\n",
    "\n",
    "    # Sort the chunks based on similarity scores in descending order\n",
    "    similarities.sort(key=lambda x: x[1], reverse=True)\n",
    "    # Return the top-k most relevant chunks\n",
    "    return [x[0] for x in similarities[:k]]"
   ],
   "id": "ee12f4e551e46871",
   "outputs": [],
   "execution_count": 10
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 查询",
   "id": "d27646d5b221fa92"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:27.467658Z",
     "start_time": "2025-04-23T11:28:27.168546Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# Load validation data\n",
    "with open('../../data/val.json', encoding=\"utf-8\") as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "query = data[0]['question']\n",
    "\n",
    "# Retrieve the top 2 most relevant text chunks\n",
    "top_chunks = semantic_search(query, embeddings, k=2)\n",
    "\n",
    "# Print the results\n",
    "print(\"Query:\", query)\n",
    "for i, chunk in enumerate(top_chunks):\n",
    "    print(f\"Header {i+1}: {chunk['header']}\")\n",
    "    print(f\"Content:\\n{chunk['text']}\\n\")"
   ],
   "id": "8ddde8ed9a251187",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: 什么是‘可解释人工智能’，为什么它被认为很重要？\n",
      "Header 1: 人工智能的未来发展与伦理挑战\n",
      "Content:\n",
      "问题也随之⽽来。为⼈⼯智能的开发\n",
      "和部署建⽴清晰的指导⽅针和道德框架⾄关重要。\n",
      "⼈⼯智能武器化\n",
      "⼈⼯智能在⾃主武器系统中的潜在应⽤引发了重⼤的伦理和安全担忧。需要开展国际讨论并制定相\n",
      "关法规，以应对⼈⼯智能武器的相关⻛险。\n",
      "第五章：⼈⼯智能的未来\n",
      "⼈⼯智能的未来很可能以持续进步和在各个领域的⼴泛应⽤为特征。关键趋势和发展领域包括：\n",
      "可解释⼈⼯智能（XAI）\n",
      "可解释⼈⼯智能 (XAI) 旨在使⼈⼯智能系统更加透明易懂。XAI 技术正在开发中，旨在深⼊了解⼈\n",
      "⼯智能模型的决策⽅式，从⽽增强信任度和责任感。\n",
      "边缘⼈⼯智能\n",
      "边缘⼈⼯智能是指在设备上本地处理数据，⽽不是依赖云服务器。这种⽅法可以减少延迟，增强隐\n",
      "私保护，并在连接受限的环境中⽀持⼈⼯智能应⽤。\n",
      "量⼦计算和⼈⼯智能\n",
      "量⼦计算有望显著加速⼈⼯智能算法，从⽽推动药物研发、材料科学和优化等领域的突破。量⼦计\n",
      "算与⼈⼯智能的交叉研究前景⼴阔。\n",
      "⼈机协作\n",
      "⼈⼯智能的未来很可能涉及⼈类与⼈⼯智能系统之间更紧密的协作。这包括开发能够增强⼈类能\n",
      "⼒、⽀持决策和提⾼⽣产⼒的⼈⼯智能⼯具。\n",
      "⼈⼯智能造福社会\n",
      "⼈⼯智能正⽇益被⽤于应对社会和环境挑战，例如⽓候变化、贫困和医疗保健差距。“⼈⼯智能造\n",
      "福社会”倡议旨在利⽤⼈⼯智能产⽣积极影响。\n",
      "监管与治理\n",
      "随着⼈⼯智能⽇益普及，监管和治理的需求将⽇益增⻓，以确保负责任的开发和部署。这包括制定\n",
      "道德准则、解决偏⻅和公平问题，以及保护隐私和安全。国际标准合作⾄关重要。\n",
      "通过了解⼈⼯智能的核⼼概念、应⽤、伦理影响和未来发展⽅向，我们可以更好地应对这项变⾰性\n",
      "技术带来的机遇和挑战。持续的研究、负责任的开发和周到的治理，对于充分发挥⼈⼯智能的潜⼒\n",
      "并降低其⻛险⾄关重要。\n",
      "第六章：⼈⼯智能和机器⼈技术\n",
      "⼈⼯智能与机器⼈技术的融合\n",
      "⼈⼯智能与机器⼈技术的融合，将机器⼈的物理能⼒与⼈⼯智能的认知能⼒完美结合。这种协同效\n",
      "应使机器⼈能够执⾏复杂的任务，适应不断变化的环境，并与⼈类更⾃然地互动。⼈⼯智能机器⼈\n",
      "⼴泛应⽤于制造业、医疗保健、物流和勘探领域。\n",
      "机器⼈的类型\n",
      "⼯业机器⼈\n",
      "⼯业机器⼈在制造业中⽤于执⾏焊接、喷漆、装配和物料搬运等任务。⼈⼯智能提升了它们的精\n",
      "度、效率和适应性，使它们能够在协作环境中与⼈类并肩⼯作（协作机器⼈）。\n",
      "服务机器⼈\n",
      "服务机器⼈协助⼈类完成各种任务，包括清洁、送货、客⼾服务和医疗\n",
      "\n",
      "Header 2: 人工智能在各行业的应用与伦理挑战\n",
      "Content:\n",
      "改变交通运输。⾃动驾驶\n",
      "汽⻋利⽤⼈  ⼯智能感知周围环境、做出驾驶决策并安全⾏驶。\n",
      "零售\n",
      "零售⾏业利⽤⼈⼯智能进⾏个性化推荐、库存管理、客服聊天机器⼈和供应链优化。⼈⼯智能系统\n",
      "可以分析客⼾数据，预测需求、提供个性化优惠并改善购物体验。\n",
      "制造业\n",
      "⼈⼯智能在制造业中⽤于预测性维护、质量控制、流程优化和机器⼈技术。⼈⼯智能系统可以监控\n",
      "设备、检测异常并⾃动执⾏任务，从⽽提⾼效率并降低成本。\n",
      "教育\n",
      "⼈⼯智能正在通过个性化学习平台、⾃动评分系统和虚拟导师提升教育⽔平。⼈⼯智能⼯具可以适\n",
      "应学⽣的个性化需求，提供反馈，并打造定制化的学习体验。\n",
      "娱乐\n",
      "娱乐⾏业将⼈⼯智能⽤于内容推荐、游戏开发和虚拟现实体验。⼈⼯智能算法分析⽤⼾偏好，推荐\n",
      "电影、⾳乐和游戏，从⽽增强⽤⼾参与度。\n",
      "⽹络安全\n",
      "⼈⼯智能在⽹络安全领域⽤于检测和应对威胁、分析⽹络流量以及识别漏洞。⼈⼯智能系统可以⾃\n",
      "动执⾏安全任务，提⾼威胁检测的准确性，并增强整体⽹络安全态势。\n",
      "第四章：⼈⼯智能的伦理和社会影响\n",
      "⼈⼯智能的快速发展和部署引发了重⼤的伦理和社会担忧。这些担忧包括：\n",
      "偏⻅与公平\n",
      "⼈⼯智能系统可能会继承并放⼤其训练数据中存在的偏⻅，从⽽导致不公平或歧视性的结果。确保\n",
      "⼈⼯智能系统的公平性并减少偏⻅是⼀项关键挑战。\n",
      "透明度和可解释性\n",
      "许多⼈⼯智能系统，尤其是深度学习模型，都是“⿊匣⼦”，很难理解它们是如何做出决策的。增\n",
      "强透明度和可解释性对于建⽴信任和问责⾄关重要。\n",
      "隐私和安全\n",
      "⼈⼯智能系统通常依赖⼤量数据，这引发了⼈们对隐私和数据安全的担忧。保护敏感信息并确保负\n",
      "责任的数据处理⾄关重要。\n",
      "⼯作岗位流失\n",
      "⼈⼯智能的⾃动化能⼒引发了⼈们对⼯作岗位流失的担忧，尤其是在重复性或常规性任务的⾏业。\n",
      "应对⼈⼯智能驱动的⾃动化带来的潜在经济和社会影响是⼀项关键挑战。\n",
      "⾃主与控制\n",
      "随着⼈⼯智能系统⽇益⾃主，控制、问责以及潜在意外后果的问题也随之⽽来。为⼈⼯智能的开发\n",
      "和部署建⽴清晰的指导⽅针和道德框架⾄关重要。\n",
      "⼈⼯智能武器化\n",
      "⼈⼯智能在⾃主武器系统中的潜在应⽤引发了重⼤的伦理和安全担忧。需要开展国际讨论并制定相\n",
      "关法规，以应对⼈⼯智能武器的相关⻛险。\n",
      "第五章：⼈⼯智能的未来\n",
      "⼈⼯智能的未来很可能以持续进步和在各个领域的⼴泛应⽤为特征。关键趋势和发展领域包括：\n",
      "可解释⼈⼯智能（XAI）\n",
      "可解释⼈⼯智能 (XAI) 旨在使⼈⼯智\n",
      "\n"
     ]
    }
   ],
   "execution_count": 11
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 基于检索到的片段生成回答",
   "id": "c2e4b777d5d3812"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:48.806088Z",
     "start_time": "2025-04-23T11:28:27.488005Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# AI 助手的系统提示\n",
    "system_prompt = \"你是一个AI助手，严格根据给定的上下文进行回答。如果无法直接从提供的上下文中得出答案，请回复：'我没有足够的信息来回答这个问题。'\"\n",
    "\n",
    "def generate_response(system_prompt, user_prompt):\n",
    "    \"\"\"\n",
    "    基于检索到的文本块生成 AI 回答。\n",
    "\n",
    "    Args:\n",
    "    retrieved_chunks (List[str]): 检索到的文本块列表\n",
    "    model (str): AI model.\n",
    "\n",
    "    Returns:\n",
    "    str: AI-generated response.\n",
    "    \"\"\"\n",
    "    # Generate the AI response using the specified model\n",
    "    response = client.chat.completions.create(\n",
    "        model=os.getenv(\"LLM_MODEL_ID\"),\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    # Return the content of the AI response\n",
    "    return response.choices[0].message.content\n",
    "\n",
    "# 将检索到的文本块合并为一个上下文字符串\n",
    "context = \"\\n\".join([f\"Context {i+1}:\\n{chunk}\" for i, chunk in enumerate(top_chunks)])\n",
    "\n",
    "# 通过组合上下文和查询创建用户提示\n",
    "user_prompt = f\"{context}\\n\\nQuestion: {query}\"\n",
    "ai_response = generate_response(system_prompt, user_prompt)\n",
    "print(\"AI Response:\\n\", ai_response)"
   ],
   "id": "5e4b77be75e8816a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AI Response:\n",
      " 可解释人工智能（XAI）旨在使人工智能系统更加透明易懂。XAI 技术正在开发中，旨在深入了解人工智能模型的决策方式，从而增强信任度和责任感。它被认为很重要，因为许多人工智能系统，尤其是深度学习模型，都是“黑匣子”，很难理解它们是如何做出决策的。增强透明度和可解释性对于建立信任和问责至关重要。\n"
     ]
    }
   ],
   "execution_count": 12
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 评估",
   "id": "9de5efb55325a124"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-23T11:28:52.386203Z",
     "start_time": "2025-04-23T11:28:48.833297Z"
    }
   },
   "cell_type": "code",
   "source": [
    "evaluate_system_prompt = \"你是一个智能评估系统，负责评估AI助手的回答。如果AI助手的回答与真实答案非常接近，则评分为1。如果回答错误或与真实答案不符，则评分为0。如果回答部分符合真实答案，则评分为0.5。\"\n",
    "\n",
    "evaluation_prompt = f\"用户问题: {query}\\nAI回答:\\n{ai_response}\\nTrue Response: {data[0]['ideal_answer']}\\n{evaluate_system_prompt}\"\n",
    "\n",
    "# Generate the evaluation response using the evaluation system prompt and evaluation prompt\n",
    "evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)\n",
    "print(evaluation_response)"
   ],
   "id": "e7444e2afdfa32b",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 \n",
      "\n",
      "AI助手的回答与真实答案非常接近。两者都强调了可解释人工智能（XAI）的目标是提高透明度和可理解性，以及其重要性在于建立信任、问责制和公平性。AI助手的回答还额外提到了深度学习模型的“黑匣子”特性，这进一步丰富了回答内容。因此，评分为1。\n"
     ]
    }
   ],
   "execution_count": 13
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
